The goal of this notebook is to identify other clusters of mutations with same approach used to define mutations in Genome-1 and Genome-2. Hopefully, these can be used to establish some base subclonal haplotypes.

Haplotyping subclonal mutations

I’ll try to use the same haployping approach that I used to cluster the mutations for genome-1 and genome-2. To do this, I’ll break down the mutations into band of frequency. I think this will work the best to get clusters of mutations.

Haplotypes found at 25% or more.

Can I link the SNPs that reach reasonably high frequency in some samples? I define these as reaching 25% or more in at least one tissue. However, I’m leaving out the mutations that are fixed in 10 or more samples - these should probably be included in the SSPE consensus.

I came to 25% or more and ~20 clusters after toying around with different settings.

## `summarise()` has grouped output by 'Tissue', 'cluster', 'cluster_size'. You can override using the `.groups` argument.

Clearly, some of the clusters are mutations on G1 or G2 that are missing from a single tissue. This could be genuine, OR this could be because of a caller specific issue or low coverage in a give tissue. Also, some of these clusters with one SNP might be consensus mutations or mutations on both backgrounds that need to be resolved separately.

Now, I’ll look at the clusters that have multiple SNPs where the allele frequency goes above 25% in at least one tissue sample and have the clearest correlations. I’ll also try to figure out which background they’re on based on the ‘pigeon hole principle’.

Now, I’ll work on the remaining clusters that are correlated, but might be broken down into further clusters.

Now, cluster 19 needs a little more rectifying. I think cluster 19 is two to three separate but related haplotypes.

Finally, cluster 4. These mutations probably can’t be haplotyped with this method because they are mostly found in the Frontal Cortext 2 and missing everywhere else. I would hazard a guess that these are mutations on the background of genome-1-1, but this would need to be proven with read information.

Haplotypes found between 25% and 5%

Most of these will be impossible to haplotype. I’ll try for the ones that seem the most clear cut.

## `summarise()` has grouped output by 'Tissue', 'cluster', 'cluster_size'. You can override using the `.groups` argument.

These all looks fairly promising…

What’s left to cluster?

There are still a good number of mutations that need to be clustered. About 300. However, over half of these can’t physically be haplotyped because they are too low frequency. Several of these are also probably only in a single sample.

## [1] "There are about 374 that are still totally un-haplotyped."
## [1] "There are  121 that have at least been clustered."
## [1] "There are still about 189 that can reasonably be haplotyped."

What positions should be haplotyped?

Using the information from what’s been haplotyped using this method, what’s the best region to attempt to haplotype?

END